On Lower Bounds for Regret in Reinforcement Learning

نویسندگان

  • Ian Osband
  • Benjamin Van Roy
چکیده

We consider the problem of learning to optimize an unknown MDP M∗ = (S,A, R∗, P ∗). S = {1, .., S} is the state space, A = {1, .., A} is the action space. In each timestep t = 1, 2, .. the agent observes a state st ∈ S, selects an action at ∈ A, receives a reward rt ∼ R(st, at) ∈ [0, 1] and transitions to a new state st+1 ∼ P (st, at). We define all random variables with respect to a probability space (Ω,F ,P).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Near-optimal Regret Bounds for Reinforcement Learning Near-optimal Regret Bounds for Reinforcement Learning

This technical report is an extended version of [1]. For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s i...

متن کامل

Near-optimal Regret Bounds for Reinforcement Learning

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). We present a reinfo...

متن کامل

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic on...

متن کامل

Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning

We consider the problem of undiscounted reinforcement learning in continuous state space. Regret bounds in this setting usually hold under various assumptions on the structure of the reward and transition function. Under the assumption that the rewards and transition probabilities are Lipschitz, for 1-dimensional state space a regret bound of Õ(T 3 4 ) after any T steps has been given by Ortner...

متن کامل

Adaptive aggregation for reinforcement learning in average reward Markov decision processes

We present an algorithm which aggregates online when learning to behave optimally in an average reward Markov decision process. The algorithm is based on the reinforcement learning algorithm UCRL and uses confidence intervals for aggregating the state space. We derive bounds on the regret our algorithm suffers with respect to an optimal policy. These bounds are only slightly worse than the orig...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1608.02732  شماره 

صفحات  -

تاریخ انتشار 2016